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ABSTRACT 

Gender differences in achievement test performance at 
the college level were studied as part of the initial analysis of the 
recently developed Collegiate Assessment of Academic Proficiency 
(CAAP) — an achievement test battery for use in higher education. The 
CAAP was pilot tested in fall 1988 and includes a measure of writing 
proficiency and four objective tests: (1) reading; (2) writing 
skills; (3) mathematics; and (4) critical thinking. Random samples of 
1,000 males and 1,000 females were drawn for each objective test from 
students from about 100 post-secondary institutions. The students 
tested were primarily incoming college freshmen. The total number of 
examinees (n=l,490) was included in the writing analysis. The means, 
standard deviations, t-statistics, and effect sizes for each test 
indicated that females tended to perform better than males on the 
multiple-choice writing skills test and on the essay test. Males 
tended to outperform females on the mathematics test. There were no 
overall performance differences between males and females on the 
reading and critical think, g tests, but there were differences 
associated with individual passages. Effect sizes for performance 
differences were generally small, but were in keeping with results 
found for other tests and examinee populations. Two data tables are 
provided. (SLD) 
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Concern about equity with respect to men and women has generated 
considerable interest in educational achievement. Differences in the 
educational backgrounds and achievement of the two groups are likely to 
concribute to disparities in the allocation of cognitively demanding roles in 
our society. Consequently, group differences in relevant test scores are 
cause fcr concern. The focus of this study is on the measurement of gender 
differences in achievement test performance at the college le ~\ 

Differences in performance patterns on standardized test batteries have 

frequently been found for males and females, Stanley and his colleagues 

(Brody, 1987; Dauber, 1987; Lupkowski, 1987; Stanley, 1987) investigated 

gender differences on some 82 nationally standardized tests. To measure the 
/ 

size of differences in mean scores, they used Cohen's (1977) concept of effect 
size (mean score differences in standard units). Fairly large effect sizes 
(,50 to ,90) were found for aptitude tests and for advanced achievement tests 
such as the advanced tests of the Graduate Record Examinations, Effect sizes 
were smaller for other standardized achievement tests, including college 
admissions tests. Recent ACT assessment data (ACT, 1988) yielded an effect 
size of .23 in English Usage favoring females and effect sizes of .22 (Social 
Studies Reading), .33 (Mathematics), and .38 (Natural Sciences Reading) 
favoring males. 

Gender differences found on the ACT are generally consistent with those 
found for the SAT. A possible inconsistency is that females do better than 
males on the ACT English Usage Test, but that males do better than females on 
the SAT-Verbal (Clark & Grandy, 1984). However, the SAT-Verbal includes some 
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scientific and technical reading items on which females do substantially less 
well than males (Wendler & Carlton, 1987)- This effect is consistent with 
performance differences favoring males on the ACT Natural Sciences Reading 
Test. 

ACT has recently developed the Collegiate Assessment of Academic 
Proficiency (CAAP) as n new achievement test battery for use in higher 
education. In Fall 1988, CAAP was pilot tested on a national sample of 
college students. This research was done as part of the initial analysis of 
CAAP data and had as its focus the investigation of gender differences in test 
performance. 

Methodology 

The Instrument 

CAAP has been developed as a test battery with components directed toward 
the measurement of academic skills typically attained in the first two years 
of college. The various tests in the CAAP battery are each ^0 minutes in 
length and can be used independently or in any configuration. Mo overall 
composite score is offered. 

As configured in Fall 1988, CAAP included four objective tests— Reading, 
Writing Skills, Mathematics, and Critical Thinking— and a direct measure of 
writing proficiency. 

The Reading Test measures student achievement in reading comprehension 
using questions based on reading selections in prose fiction, the humanities, 
the social sciences, and the natural sciences. Each form of the 36-item test 
contains four reading passages that are representative of the kinds of texts 
commonly encountered in college and university curricula. 



Each passage is accompanied by a set of multiple-choice questions that 
require students to derive meaning, manipulate information, cite comparisons, 
make generalization.*, and draw conclusions. The test focuses on a complex sec 
of skills that students must use in comprehending written materials across a 
range of subject areas and purposes. 

The 72-item Writing Skills Test is an indirect measure of writing 
skill. The test requires examinees to analyze prose similar to that found in 
a typical course of college study. Several prose passages are included, each 
of which is accompanied by a sequence of multiple-choice test items measuring 
understanding of the conventions of standard written English and rhetorical 
skills such as strategy, organization, and style. To provide a variety of 
rhetorical situation?, a range of discourse is employed. 

The 35-item Mathematics Test measures the achievement of mathematical 
skills generally taught in first- or second-year college mathemrMcs 
courses. It emphasizes the solution of quantitative problems that are 
encountered in many postsecondary algebra courses and also includes some 
trigorometry and introductory calculus. The test emphasizes quantitative 
reasoning rather than memorization of formulas, knowledge of techniques, or 
computational skills. 

The Critical Thinking Tp <k measures the ability to clarify, analyze, 
evaluate, and extend arguments. The test consists of 32 items related to 
three passages that* are representative of the kinds of issues commonly 
encountered in a postsecondary curriculum. Each passage presents one or more 
arguments and may use one of a variety of formats, including case studies, 
debates, dialogues, overlapping positions, statistical arguments, experimental 
results, and editorials. 
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The Writing (Essay) Test constitutes a direct approach to the measurement 
of writing. Each form of the test consists of two independent writing 
prompts. The two prompts involve different issues and audiences, but each 
requires the examinee to formulate a clear thesis; support the thesis with an 
argument or reasons relevant to the issue, position taken, and audience; and 
present the argument in a well-organized, logical manner. 

As initially administered, each examinee received two scores per 
prompt. A "purpose" score reflected how well the examinees responded to the" 
task required by the situations described in the prompts. A "language usage" 
score reflected the raters 1 impressions of the relative presence of usage or 
mechanical errors and the degree to which such errors impeded the flow of 
thought in the essays. Each paper was scored on a 4-point scale for purpose 
and language usage, separately, by each of two raters working independently. 
The evaluations of both raters were averaged to obtain the purpose and 
language usage scores for eacli prompt. Additionally, the scores for the two 
prompts were averaged to yield a composite purpose score and a composite 
language usage score on a scale of 1.0 to 4.0. 
Data Source 

CAAP was pilot tested in Fall 1988 on a national sample of students from 
about 100 postsecondary institutions. The sampie included a variety of 
institutions, two- and four-year, public and private. The sample was not, 
however, designed to be nationally representative. The involved institutions 
were simply a sample of those interested in the CAAP program and able to begin 
a new testing program in the fall. The students tested at these institutions 
were primarily incoming college freshmen. 
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Random samples of 1,000 males and 1,000 females for each objective test 
were drawn for analysis. Because the total number of examinees given the 
Writing (Essay) Test was not large, all of these students were included in the 
essay analyses. 
Analyses 

Mean performance for males and females was compared on all CAAP tests and 
for eacn provided essay score. To investigate possible passage effects, mean 
performances were also compared for each passage-related set of items in the 
Reading and Critical Thinking tests. T-tests were run and effect sizes were 
calculated to assist in evaluating group differences in performance. 

Finally, the Mantel-Haenszel (M-H) procedure (Holland & Thayer, 1986) was 
used at the individual item level of the objective tests to measure gender- 
based differential item performance. The intent of these analyses was to 
identify categories of items that seemed to be operating differently for the 
two groups. 

Results 

Table 1 presents the means, standard deviations, t-statistics, and effect 
sizes found for each test. These results indicate that females tended to 
perform better than males on the multiple choice Writing Skills Test and on 
the essay test; males tended to outperform females on the Mathematics Test. 

Although no overall performance differences were found between males and 
females on the Reading and Critical Thinking tests, there were notable 
differences associated with individual passages. Females performed relatively 
well on the items associated with Reading Passage 2 (art topic); and males 
performed relatively well on the items with Reading Passage 1 (scientific 
context; and Critical Thinking Passage 2 (scientific context). 



In terms of the magnitude of performance differences on the tests, effect 
sizes were generally small. The Mathematics (favoring males) and the Writing 
Skills (favoring females) effect sizes were .20 and -.28, respectively. The 
effect sizes for the Writing (Essay) composite scores were larger: -.41 for 
the purpose score and -.32 for the language usage score, both scores favoring 
females. 

Mantel-Haenszel procedures were used, but not in the typical sense of 
identifying individual items for differential item performance. This use of" 
differential item performance methodology was exploratory in nature, intended 
to look for categories of items that favored either males or females. A 
summary of these exploratory analyses is presented in Table 2. The numbers of 
items meeting a very relaxed flagging criterion (+ 0,2 on the M-H delta) are 
presented for various subcategories of items. The number of items in each 
category that seem to favor males or females are shown. 

Generally the results in Table 2 portray seemingly random distributions 
of items favoring males or females in the various subcategories. However, the 
strong, apparently nonrandom patterns of items favoring males in Passage 1 of 
the Reading Test and Passage 2 of the Critical Thinking Test are consistent 
with the subscore results presented in Table 1. Also, the pattern of items 
favoring females in Passage 2 of the Reading Test is consistent with 
Table 1. Results for several other item categories were suggestive: 

• Writing Skills "grammar" items ~ 4 items favoring females to 1 item 

favoring males; 

• Writing Skills "sentence structure" items — 8 to 3 favoring females; 

• Writing Skills "organization" items — 6 to 1 favoring males; 

• Mathematics "applications" items — 6 to 3 favoring males. 



Discussion 

The outcomes of this study with the CAAP tests are generally consistent 
with results found for other tests and examinee populations. The effect size 
of .20 found for the CAAP Mathematics Test was smaller than that found in 
other research with different programs, but consistent with them in showing 
higher scores for males. Although the patterns of differential performance 
for several of the mathematics categories in Table 2 seem consistent with 
previous research (Doolittle & Cleary, 1987; Doolittle, in press; Marshall, 
1984), additional research would be necessary to substantiate these 
relationships. 

Effect sizes favoring females for the writing instruments (multiple 
choice, -.28; essay, -.41 and --32) were also generally consistent with 
previous findings. However the differences for the essay scores were somewhat 
larger than expected, based on the results with the multiple choice Writing 
Skills Test. 

Finally the results for the Reading and Critical Thinking tests are 
interesting in that notable gender differences are found for items associated 
with specific passages, but not for the overall tests. Clearly, because of 
the limited number of passages examined here, further research with additional 
test forms would seem to be necessary. However it appears that, consistent 
with Wendler and Carlton (1987), females may do better than males with items 
based on humanities reading passages, but poorer than males on items 
associated with science-oriented passages. (It is important to note that the 
test items do not directly measure knowledge of the content of associated 
passages, but rather reading, understanding, or reasoning within context.) 
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It appears that the performance differences between males and females 
found with CAAP are similar to those found with other achievement tests and 
populations. Clearly, when mean differences are usually less than half a 
standard deviation apart, there is considerable overlap in score 
distributions. However, these ^:eem to be stable, group-level differences that 
are observed in many testing situate s. 

Differential background, interests, and even demographic factors related 
to male and female examinee groups, may be relevant for an accurate 
interpretation of group differences in test performance. But to the extent 
that the differences are real — on content that is a significant part of 
achievement as measured by CAAP or other achievement tests they are simply 
indications of the differential achievement of students. 
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Mean Compari 



Males 



Test/Score N 


Mean 


S.D. 


Reading 1000 


20.42 


6.52 


Passage 1 


5.62 


1.92 


Passage 2 


5.50 


2.08 


Passage 3 


4.71 


2.21 


Passage 4 


4.59 


2.74 


Writing Skills 1000 


43.39 


13.65 


rid ulltJiUd LlOa J.UUU 


16,18 


4.41 


Critical Thinking 1000 


19.29 


5.15 


rraSSaye J. 


7.32 


2.00 


Passage 2 


6.41 


2.32 


Passage 3 


5.57 


2.26 


Writing (Essay) 1490 






Prompt 1 Purpose 


2.50 


.79 


Prompt 1 Lang. Usage 


2.62 


.67 


Prompt 2 Purpose 


2.16 


• 81 


Prompt 2 Lang. Usage 


2.60 


.68 


Purpose (Composite) 


2.33 


.66 


Lang. Usage (Composite) 


2.61 


.62 
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TABLE 1 
of Male and Female Examinees 
Females 



N 


Mean 


S.D. 


_t 


Prob. 


Effect S 


1000 


20.46 


6.12 


- .14 


.889 


-.01 




5.24 


1.83 


4.44 


.000 


.20 




5 . 76 


1.93 


-2.92 


.004 


-.13 




4.77 


2.15 


- .58 


.559 


-.03 




4.68 


2.63 


- .77 


.444 


-.03 


1000 


47.09 


12.43 


-6.31 


.000 


-.28 


1000 


15.34 


3.82 


4.58 


.000 


.20 


1000 


18.92 


5.07 


1.60 


.110 


.07 




7.15 


1.94 


1.94 


.050 


.09 




6.05 


2.33 


3.45 


.001 


.15 




5.71 


2.15 


-1.68 


.094 


-.06 


2282 














2.79 


.79 


-11.11 


.000 


-.36 




2.82 


.66 


-8.98 


.000 


-.30 




2.42 


.85 


-9.27 


.000 


-.31 




2.81 


.66 


-9.10 


.000 


-.31 




2.61 


.68 


-12.33 


.000 


-.41 




2.81 


.61 


-9.84 


.000 


-.32 
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Table 2 

Differential Itea Performance (by Favored Group) 
for Iteiq/Passage Categories 1 



Test 



Reading 



Writing Skills 



Mathematics 



Critical Thinking 





Total 


Number 


Numb? 


Subcategory 


Items 


Favoring Males 


Favor ing 


Referring 


8 


2 


4 


Reasoning 


28 


10 


9 


Passage 1 


9 


7 


0 


Passage 2 


9 


1 


5 


Passage 3 


9 


3 


3 


Passage 4 


9 


1 


5 


Grammar 


8 


1 


4 


Sentence Structure 


18 


3 


8 


Organization 


10 


6 


1 


Style 


14 


6 


3 


Strategy 


16 


5 




Punctuation 


6 


2 


1 


Passage 1 


12 


4 




Passage 2 


12 


5 


4 


Passage 3 


12 


2 


4 


Passage 4 


12 


4 


2 


Passage 5 


12 


3 


4 


Passage 6 


12 


5 


5 


Pre-Algebra 


7 


4 


2 


Algebra 


20 


5 


7 


Trig ./Calculus 


8 


2 


1 


Basic Skills 


24 


5 


7 


Applications 


11 


6 


3 


Analysis 


16 


4 


5 


Evaluation 


17 


3 


1 


Extension 


9 


3 


3 


Passage 1 


11 


3 


2 


Passage 2 


11 


6 


2 


Passage 3 


10 


1 


5 



*An extremely loose criterion on the Mantel-Haenszel delta statistic (Holland & Thayer, 
1986) was used to identify items performing differently for males and females. 
Although this criterion was far too loose for evaluating individual items, it was used 
for exploratory purposes in evaluating trends for the various subcategories of items. 
This criterion flagged about 60% of the test items as favoring one aroup or the other. 



